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^ (57) Abstract: A method for improving the reliability and/or accuracy of physical measurements obtained from array hybridization 
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10 PROCESS FOR REMOVING SYSTEMATIC ERROR AND OUTLIER DATA AND 
FOR ESTIMATING RANDOM ERROR IN CHEMICAL AND BIOLOGICAL 

ASSAYS 



Field Of The Invention 

15 The present invention relates to a process for 

making evaluations which objectify analyses of data 
obtained from hybridization arrays. The present invention 
in one aspect is a process for removing systematic error 
present in replicate genomic samples. A second aspect^ is a 

20 process for detecting and deleting extreme value data 
points (outliers) . A third aspect is an optimization 
process for detecting and removing extreme value data 
points (outliers) . A fourth aspect, is a process for 
estimating the extent of random error present in replicate 

25 genomic samples composed of small numbers of data points. 

Background Of The Invention 

Array-based genetic analyses start with a large 
library of cDNAs or oligonucleotides (probes), immobilized 

30 on a substrate. The probes are hybridized with a single 
labeled sequence, or a labeled complex mixture derived from 
a tissue or cell line messenger RNA (target) . As used 
herein, the term "'probe" will therefore be understood to 
refer to material tethered to the array, and the term 

35 ""target" will refer to material that is applied to the 
probes on the array, so that hybridization may occur. 
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The term '^element" will refer to a spot on an 
array. Array elements reflect probe/target interactions. 
The term ''background" will refer to area on the substrate 
outside of the elements. 
5 The term '^replicates" will refer to two or more 

measured values of the same probe/target interaction. Replicates 
may be independent (the measured values are independent) or 
dependent (the measured values are related, statistically 
correlated, or reaction paired) . Replicates may be within arrays, 

10 across arrays, within experiments, across experiments, or any 
combination thereof. 

Measured values of probe/target interactions are a 
function of their true values and of measurement error. The term 
"outlier" will refer to an extreme value in a distribution of 

15 values. Outlier data often result from uncorrectable measurement 
errors and are typically deleted from further statistical 
analysis. 

There are two kinds of error, random and systematic, 
which affect the extent to which observed (measured) values 

20 deviate from their true values. 

Random errors produce fluctuations in observed values 
of the same process or attribute. The extent and the 
distributional form of random errors can be detected by repeated 
measurements of the same process or attribute. Low random error 

25 corresponds to high precision. 

Systematic errors produce shifts (offsets) in measured 
values. Measured values with systematic errors are said to be 
''biased". Systematic errors cannot be detected by repeated 
measurements of the same process or attribute because the bias 

30 affects the repeated measurements equally. Low systematic error 
corresponds to high accuracy. The terms "systematic error", 
"bias", and "offset" will be used inter-changeably in the present 
document . 

An invention for estimating random error present in 
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replicate genomic samples composed of small numbers of data points 
has been described by Ramm and Nadon in ''Process for Evaluating 
Chemical and Biological Assays", International Application No. 
PCT/IB99/00734. In a preferred embodiment, the process described 

5 therein assumed that, prior to conducting statistical tests, 
systematic error in the measurements had been removed and that 
outliers had been deleted. 

In accordance with one aspect, the present invention 
is a process that estimates and removes systematic error from 

10 measured values. In another aspect, it is a process for 
optimizing outlier detection and deletion. A second aspect is a 
process for detecting and deleting outliers. A third aspect is a 
process for optimizing outlier detection and deletion 
automatically. A fourth aspect is a process for estimating the 

15 extent of random error present in replicate genomic samples 
composed of small numbers of data points. 

There are two types of systematic error potentially 
present in hybridization arrays. 

Array elements may be offset within arrays. Typically, 

20 this offset is additive. It can derive from various sources, 
including distortions in the nylon membrane substrate (Duggan, 
Bittner, Chen, Meltzer, & Trent ''Expression profiling using cDNA 
microarrays". Nature Genetics, 21, 10-14 (1999) . 

If present, the offset is corrected by a procedure 

25 called "background correction", which involves subtracting from 
the array element the intensity of a background area outside of 
the element. 

Areas used for calculation of background can be close 
to the array element (e.g., a circle lying around the element), or 
30 distant (a rectangle lying around the entire array) . Because 
offset within an array tends to be specific to individual array 
elements {even with relatively uniform background) , areas close to 
the element are generally preferred for background correction. 

Alternatively, background estimates can be obtained 
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from **blank" elements (i.e., elements without probe material). In 
this procedure, ''background" is defined differently from the more 
typical method described in the previous paragraph. 
Theoretically, blank element intensities are affected by the same 

5 error factors that affect non-element background areas (e.g., 
washing procedures) and also by error factors which affect element 
quantification but which are extraneous to the biological signal 
of interest (e.g., dispensing errors). 

The present invention does not address the issue of 

10 background correction. In a preferred embodiment, background 
correction, as necessary, has been applied prior to estimation of 
systematic error and outlier detection. In a non-preferred 
embodiment, the process may still be applied to arrays which have 
not been corrected for background offset. 

15 In one aspect, the present invention is a process for 

estimating and removing systematic error across arrays. Contrary 
to background offset, offset across arrays tends to be 
proportional. 

Offset across arrays can derive from various sources. 

20 For microarray studies which use fluorescent labelling, offset 
factors include target quantity, extent of target labelling, fluor 
excitation and emission efficiencies, and detector efficiency. 
These factors may influence all elements equally or may in part be 
specific to element subsets of the array. For example, quantity 

25 of target material may be offset differently for different robotic 
arrayer spotting pin locations (see Bowtell '^Options available - 
from start to finish - for obtaining expression data by 
microarray" Nature Genetics, 21, 25-32, p. 31 (1999). 

For radio-labelled macro array studies, proportional 

30 offset factors include target quantity and target accessibility 
(Perret, Ferrein, Marinx, Liauzun, et al. in "Improved differential 
screening approach to analyse transcriptional variations in 
organized cDNA libraries" Gene, 208, 103-115 (1998)). 

Time of day that arrays are processed (Lander ^'Array of 
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hope" Nature Genetics, 21, 3-4 (1999)) and variations in chemical 
wash procedures across experiments (Shalon, Smith, & Brown ''A DNA 
microarray system for analyzing complex DNA samples using two- 
color fluorescent probe hybridization" Genojne Research, 6, 639-645 

5 (1996)) have also been cited as offset factors. 

Prior art methods for removing systematic error are 
called ^^normalization" procedures. These procedures involve 
dividing array element values by a reference value. This 
reference can be based on all probes or on a subset (e.g., 

to ^^housekeeping genes" whose theoretical expression levels do not 
change across conditions) . However obtained, the reference can be 
estimated by one of various summary values (e.g., mean or a 
specified percentile) . 

Once systematic error has been removed, any remaining 

15 measurement error is, in theory, random. Random error reflects 
the expected statistical variation in a measured value. A 
measured value may consist, for example, of a single value, a 
siommary of values (mean, median) , a difference between single or 
summary values, or a difference between differences. In order for 

20 two values to be considered significantly different from each 
other, their difference must exceed a threshold defined jointly by 
the measurement error associated with the difference and by a 
specified probability of concluding erroneously that the two 
values differ (Type I error rate) . Statistical tests are 

25 conducted to determine if values differ significantly from each 
other. 

All of prior art normalization procedures, however, 
estimate systematic error outside of the context of a statistical 
model. Because these informal procedures make implicit (and often 
30 incorrect) assumptions about the structure of the data (e.g., form 
and extent of both systematic and random error), they often fail 
to adequately eliminate measurement bias and can introduce 
additional bias due to the normalization procedure itself. In a 
different scientific context, Freedman and Navidi, in ^'Regression 
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models for adjusting the 1980 census". Statistical Science, 1, 3- 
11 (198 6) described the problems inherent in failing to correctly 
model data that contain measurement error ("uncertainty" in their 
terminology) : 

5 Models are often used to decide issues in situations 

marked by uncertainty. However, statistical inferences 
from data depend on assumptions about the processes 
which generated those data. If the assumptions do not 
hold, the inferences may not be reliable either. This 
10 limitation is often ignored by applied workers who fail 

to identify crucial assumptions or subject them to any 
kind of empirical testing. In such circumstances, 
using statistical procedures may only compound the 
uncertainty (p. 3) . 
15 In addition to correct removal of systematic error, 

many statistical tests require the assumption that residuals be 
normally distributed. Residuals reflect the difference between 
values' estimated true scores and their observed (measured) 
scores. If a residual score is extreme (relative to other scores 
20 in the distribution) , it is called an outlier. An outlier is 
typically removed from further statistical analysis because it 
generally indicates that the measured value contains excessive 
measurement error that cannot be corrected. In order to achieve 
normally distributed residuals, data transformation is often 
25 necessary (e.g., log transform). 

In one aspect, the present invention is a process for 
detecting and removing outliers by examining the distribution of 
residuals. In another aspect, it is a process for detecting and 
removing outliers automatically through an iterative process which 
30 examines characteristics of the distribution of residuals (e.g., 
skewness, kurtosis) . 

As with correction for offset across arrays 
(normalization) , prior art for outlier detection relies on 
informal and arbitrary procedures outside of the context of a 
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statistical model. For example. Ferret, Ferr^n, Marinx, Liauzun, 
et al. ''Improved differential screening approach to analyse 
transcriptional variations in organized cDNA libraries" Gene, 208, 
103-115 (1998), compared the intensity of sets of two replicate 
5 array elements after normalization. Any replicate set that showed 
a greater than 2-fold difference (or equivalently, less than a 
0.5-fold difference) was regarded as an outlier. 

In accordance with one aspect, the present invention 
is a process for estimating the extent of random error present in 

10 replicate genomic samples composed of small numbers of data 
points and for conducting a statistical test comparing expression 
level across conditions (e.g., diseased versus normal tissue). 
It is an alternative to the method described by Ramm and Nadon in 
^'Process for Evaluating Chemical and Biological Assays", 

15 International Application No. PCT/IB99/00734 . As such, it can be 
used in addition to (or in place of) the procedures described by 
Ramm and Nadon ( Ibid) . 

Disadvantages of all prior art procedures include: 
20 1. The value chosen as a normalization reference (e.g.^ 

75'** percentile, etc.) is arbitrary; 
2. Given that the choice of normalization reference is 
arbitrary, dividing by the reference value 
overcorrects some elements and undercorrects others; 
25 3. Because prior art procedures do not estimate 

systematic error within the context of a statistical 
model, data transformations that are necessary for 
correct inferences may not be performed or may be 
applied incorrectly; 
30 4. Because prior art procedures do not estimate 

systematic error within the context of a statistical 
model, normalization can alter the true structure of 
the data; 

5. Because prior art procedures do not detect outliers 
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within the context of a statistical model, true 
outliers may go undetected and non-outliers may be 
incorrectly classified as outliers; 

6. Classification of values as outliers or not is 
5 arbitrary and subjective; 

7. Theoretical assumptions about data structure (e.g., 
that residuals are normally distributed) are not 
examined empirically. 

8. Normalization procedures may create additional 
10 measurement error that is not present in the 

original non-normalized measurements 
The term "treatment condition" will refer to an effect 
of interest. Such an effect may pre-exist (e.g., differences 
across different tissues or across time) or may be induced by an 

15 experimental manipulation. 

Hybridization arrays produced under different treatment 
conditions may be statistically dependent or independent. 
Microarray technology in which two different target treatment 
samples are labelled with different fluors and are then 

20 cohybridized onto each arrayed element represent one example of 
statistical dependence. Typically, expression ratios of the raw 
signals generated by the two fluors are examined for evidence of 
differences across treatment conditions. 

Chen, Dougherty, & Bittner '^Ratio-based decisions and 

25 the quantitative analysis of cDNA microarray images''. Journal of 
Biomedical Optics, 2, 364-374 (1997) have presented an analytical 
mathematical approach that estimates the distribution of non- 
replicated differential ratios under the null hypothesis. This 
approach is similar to the present invention in that it derives a 

30 method for obtaining confidence intervals and probability 
estimates for differences in probe intensities across different 
conditions. It differs from the present invention in how it 
obtains these estimates. Unlike the present invention, the Chen 
et al. approach does not obtain measurement error estimates from 
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replicate probe values. Instead, the measurement error associated 
with ratios of probe intensities between conditions is obtained 
via mathematical derivation of the null hypothesis distribution of 
ratios. That is, Chen et al. derive what the distribution of 
5 ratios would be if none of the probes showed differences in 
measured values across conditions that were greater than would be 
expected by ''chance." Based on this derivation, they establish 
thresholds for statistically reliable ratios of probe intensities 
across two conditions. The method, as derived, is applicable to 

10 assessing differences across two conditions only. Moreover, it 
assumes that the measurement error associated with probe 
intensities is normally distributed. The method, as derived, 
cannot accommodate other measurement error models (e.g., 
lognormal) . It also assumes that all measured values are 

15 unbiased and reliable estimates of the ''true" probe intensity. 
That is, it is assumed that none of the probe intensities are 
"outlier" values that should be excluded from analysis. Indeed, 
outlier detection is not possible with the approach described by 
Chen et al. 

20 The present invention applies the processes described 

by Ramm and Nadon in "Process for Evaluating Chemical and 
Biological Assays". International Application No. PCT/IB99/00734 
and by Ramm, Nadon and Shi in "Process for Removing Systematic 
Error and Outlier Data and for Estimating Random Error in 

25 Chemical and Biological Assays". Provisional Application No. 
60/139,639 (1999) to two or more statistically dependent genomic 
samples. 

The present invention differs from prior art in that: 

1. It can accommodate various measurement error models 
30 (e.g., lognormal); 

2. It can detect outliers within the context of a 
statistical model; 

3. It can be used to examine theoretical assumptions 
about data structure (e.g., that residuals are 
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normally distributed) . 

Detailed Description Of The Preferred Embodiment 

Suppose, for example, that expression levels for a 
5 particular data set have proportional systematic and proportional 
random error across replicate arrays. This scenario is 
represented symbolically in Equation 1: 



10 



15 



20 



(1) 



for g = 1,...,G, j = \,,..,m and i = where represents the 

associated true intensity value of array element i (which is 
unknown and fixed), Vg^ represents the unknown systematic shifts or 
offsets across replicates, and Zgi^ represents the observed random 
errors in a given condition g for spot i and replicate j. The 
interest lies in obtaining an unbiased estimate of an element's 
'^true" value K^gx) • 

Given condition g (e.g., normal cells or diseased 
counterparts), array element i, and replicate j, the associated 
intensity value is denoted as X^ij. 

Alternatively, a model with additive offset and additive 
random error would be symbolized by: 



^«/="^+^^ + ^^' (2) 

for g = 1#~.#G, J = l,...,m and i « l,...,n, where u^i represents the 
associated true intensity value of array element i (which is 
unknown and fixed), Vg^ represents the unknown systematic shifts 
30 or offsets across replicates, and e^i;, represents the observed 
random errors in a given condition g for element i and replicate 
j. The interest lies in obtaining an unbiased estimate of an 
element's "true" value (u^i) . 
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The model shown in Equation 1 will be presented as a 
preferred embodiment. Applications of the process using the model 
shown in Equation 2, however, would be obvious to one skilled in 
the art. Applications using other models (e.g., proportional 
5 offset and additive random error) would also be obvious to one 
skilled in the art. 

To make the parameters v^j (Vg^ ) identifiable in the 
model, the restriction that 2^=1 '^8 (Vg/) = 0 {^^xVgj-Q] is 
required. 

10 These parameters can be taken to be fixed or random. 

When the parameters are assumed to be random, we assume further 
that they are independent of the random errors. 

Under the model shown in Equation 1, for example, we 
have the maximum likelihood estimate (MLE) of \Xgi and v^j as 
IS follows: 

and 

=exp{i£log(^^)-log(//„))} 

20 Combining Equations 3 and 4 yields the estimate of the 

residuals [ log(i^^.) ] shown in Equation 5. 

log(f^,) = \og{X^j) - log(/i^ ) - log(v^) 

25 Because for given g and i, 

log(^^;) - log(v^.) = log(/i^,) + \og(s^l 

j = 1,..., m are independent and identically distributed as normal 

distribution with mean log (n^,) and variance a^gi. Equation 6 provides 
unbiased estimates of array elements' true values. That is, 
30 Equation 6 provides the estimated values with systematic error 
removed. 
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log (^gy)-lOg(v^) 

It is assumed that if the model is correct, the 
residuals should be normally distributed. This assumption can be 
assessed empirically by examining the slcewness and the kurtosis of 

S the distribution of the residuals as calculated according to 

Equation 5 (skewness and kurtosis measures are standard statistical 
indices; see Stuart & Ord ^^Distribution theory {6th ed. ) (Kendall's 
advanced theory of statistics Vol. 1)", New York: Halsted Press 
(1994) . Skewness is a measure of the symmetry of a distribution. 

10 Kurtosis is a measure of '^peakedness" of a distribution. Under the 
normality assumption, both skewness and kurtosis of the residual 
distribution should be approximately zero. 

Even if the model is correct for most of the data, 
outliers may cause the distribution of the entire data set to 
IS deviate from normality. Outliers can be detected and removed by 
one of the following optimization procedures: 

1. Outliers may be defined by a threshold (e.g., ±2 
standard errors away from the mean of the residuals) . 
In a preferred embodiment, any residual whose 

20 absolute value exceeds the threshold would be deleted 

from further statistical tests. 

2. An automatic iterative process that examines skewness 
and kurtosis may also be used. In this procedure, 
skewness and kurtosis are calculated for a middle 

25 proportion of scores (e.g., the middle 80%). 

Skewness and kurtosis are calculated repeatedly as 
the proportion of scores is increased in successive 
steps. The proportion of scores which produces 
optimal skewness and kurtosis values (i.e., closest 

30 to zero) is chosen as the optimal distribution of 

residuals. Scores which fall outside of the selected 
middle proportion of values are estimated as 
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outliers. In a preferred embodiment; these scores 
are deleted from further analysis. 

Statistical indices (e.g.^ confidence intervals) and 
statistical tests (e.g., t-tests, analyse-of -variance) as described 
5 by Ramm and Nadon in "Process for Evaluating Chemical and 

Biological Assays". International Application No. PCT/IB99/00734, 
can then be applied to the array element data whose residual scores 
are not outliers. 

In addition or alternatively, the statistical test 
10 described in Equations 7 and 8 can be applied to the data. 

where c^* for each condition is calculated as: 

o*^* = [median{\ x.^ - median{x^) |}f • 

15 

where = all residuals for all replicated array elements within 
a condition and c is a nomalizing factor for estimating the 
standard error of the residuals when they are normally 
distributed. Preferably, c = 1.0532, but Other values of c may be 
20 substituted. 

The z* value from Equation 7 is examined relative to a 
standard normal distribution (z-table) to assess level of 
statistical significance. Equations 7 and 8 generalize to three 
or more conditions in a manner that is obvious to one skilled in 
25 the art. 

The present invention does not preclude the use of prior 
art normalization procedures being applied to the data before 
application of the present process. This may be necessary, for 
example, when data have been obtained across different conditions 
30 and different days. Under this circumstance, data within 
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conditions may need to be normalized to a reference (e.g., 
housekeeping genes) prior to applying the present process. 

Although preferred embodiments of the invention have 
been disclosed for illustrative purposes, those skilled in the art 
5 will appreciate that many additions, deletions and substitutions 
are possible, without departing from the scope or spirit of the 
invention as defined by the accompanying claims. 
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APPENDIX 



Consider a case in which expression data were gathered 
from three replicate arrays that contained 1280 different 
5 elements. Systematic error across replicate arrays is assumed to 
be proportional and that random error across replicate arrays is 
also assumed to be proportional. This model is shown in Equation 
1 in the main body of the text. 
Normalization Method 
10 One approach is to attempt to remove the proportional 

systematic error by dividing each element within an array by a 
reference value (e.g., 75^^ percentile value of all elements 
within the array) . If systematic error is removed by the 
normalization procedure. Equation 1 becomes: 

15 

Residuals are then calculated according to Equation 5 
with the term for systematic error removed: 

20 iogii^j) = \ogiX^y)^logifi^) 

Figure 1 presents the distribution of the residuals 
with skewness and kurtosis optimized (i.e., closest to zero) and 
outliers deleted. Of 1280 residuals, 40 were detected as outliers 

25 and deleted. The skewness and kurtosis values are 

-0.27, z = 3.88; p < .0001, and 0.0006, z = .004, p = .49, 
respectively. The skewness value departs significantly from zero, 
indicating that the residuals are not normally distributed. This 
result suggests that, contrary to the assumption of the model, 

30 normalization has not adequately removed the systematic error 
component from the measured expression values. 
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Present Invention Method 

In one preferred embodiment, the present invention 
would proceed as follows: 

1. Assume the measurement model shown in Equation 1. 
5 2. Calculate the average of each element location 

across replicate arrays (Equation 3) . 

3. Estimate the systematic error for each array 
(Equation 4) . 

4 . Calculate the residuals for each array element 
10 location (Equation 5) . 



Figure 2 presents the distribution of the residuals 
with skewness and kurtosis optimized (i.e., closest to zero) and 
outliers deleted. Of 1280 residuals, 65 were detected as outliers 

15 and deleted. The skewness and kurtosis values are .073, z = 1.04; 
p = .15, and 0.039, z - 0.28, p - .39, respectively. The skewness 
and kurtosis values are not significantly different from zero, 
indicating that the residuals are approximately normally 
distributed. This result suggests that the statistical modeling 

20 process has adequately removed the systematic error component from 
the measured expression values. 
Conclusion 

In this example, the procedures described by Ramm and 
Nadon in Process for Evaluating Chemical and Biological Assays". 

25 International Application No. PCT/IB99/00734 or the procedures of 
the present invention (Equations 7 and 8) would produce valid 
results with the Present Invention Method" but not with the 
"Normalization Method". In other circumstances, depending on the 
measurement error model, prior art normalization procedures may be 

30 adequate for this purpose (e.g., proportional systematic error 
across arrays with additive random error) . However, it is likely 
that the choice of the reference value for the normalization 
procedure will be arbitrary from a statistical inference 
perspective unless the processes are followed which are described 
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in the present document and in Provisional Patent Application No. 
60/082,692. 
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WHAT IS CLAIMED IS : 
] 1. A method for improving the reliability of physical 



2 measurements obtained from array hybridization studies performed 

3 on an array having a large number of genomic samples ^ each 

4 composed of a small number of replicates insufficient for making 

5 precise and valid statistical inferences, comprising the step of 

6 estimating an error in measurement of a sample by averaging errors 

7 obtained when measuring at least one of the large number of 

8 samples and a subset of the large number of samples, and utilizing 

9 the estimated sample error as a standard for accepting or 

10 rejecting the measurement of the respective sample. 

1 2. The method of claim 1 wherein a physical measurement 

2 quantity is determined based on the difference between 

3 statistically dependent quantities, 

1 3 . The method of claim 1 wherein a physical 

2 measurement quantity determined from an entire array population is 

3 used to estimate discrete instances of that quantity for the small 

4 number of replicate samples within that population. 

1 4 . The method of claim 1 wherein the estimates of 

2 measurement error are used to plan, manage and control array 

3 hybridization studies on the basis of (a) the probability of 

4 detecting a true difference of specified magnitude between 

5 physical measurements of a given niomber of replicates, or (b) the 

6 number of replicates required to detect a true difference of 

7 specified magnitude. 

1 5. A method for improving the reliability and accuracy 

2 of physical measurements obtained from array hybridization studies 

3 performed on an array having a large number of genomic samples, 

4 each composed of a small number of replicates insufficient for 

5 making precise and valid statistical inferences, comprising the 

6 step of detecting outlier values in the measurement of a sample by 

7 combining residuals of values obtained when measuring one of the 

8 large number of samples and a subset of the large number of 

9 samples. 
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1 6. The method of claim 5 wherein outliers are detected based 

2 on the deviation of their residual values from one of the mean, median 

3 and other measurement of the residual values. 

1 7. The method of claim 5 wherein outliers are detected 

2 manually based on characteristics, including skew and kurtosiSr of 

3 the distribution of residual values. 

1 8. The method of claim 5 wherein outliers are detected 

2 based on at least one of automatically and iteratively, with 

3 respect to the characteristics, including skew and kurtosis, of 

4 the distribution of residual values. 

1 9. A method for improving the accuracy of physical 



2 measurements obtained from array hybridization studies performed 

3 on an array having a large number of genomic samples, each 

4 composed of a small number of replicates insufficient for 

5 estimating offset across arrays, the method comprising the step of 

6 averaging the differences between individual samples within one 

7 array and the average of the certain replicates across other 

8 arrays that include said one array* 

1 10. A method for improving the accuracy of physical 

2 measurements obtained from array hybridization studies performed 

3 on an array having a large number of genomic samples across two or 

4 more conditions, each composed of a small number of replicates 

5 insufficient for estimating offset across arrays, wherein 

6 measurements obtained from certain replicates are correlated 

7 across conditions, the method comprising the step of averaging the 

8 differences between individual samples within one array and the 

9 average of the certain replicates across other arrays that include 

10 said one array. 



1 11. The method of any one of claims 5-10 wherein a 

2 physical measurement quantity is determined based on the 

3 difference between statistically dependent quantities. 

1 12. The method of any one of claims 5-10 wherein a 

2 physical measurement quantity determined from an entire array 
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3 population is used to estimate discrete instances of that quantity 

4 for the small number of replicate samples within that population. 

1 13. The method of any one of claims 5-10 wherein the 

2 estimates of measurement error are used to plan, manage and 

3 control array hybridization studied on the basis of (a) the 

4 probability of detecting a true difference of specified magnitude 

5 between physical measurements; of a given number of replicates, or 

6 (b) the number of replicates; required to detect a true difference 

7 of specified magnitude. 

! 14. The method of any one of claims 1-10 used to 

2 evaluate physical measurements obtained from biological and 

3 chemical assays conducted in one of substrates, substrates 

4 containing wells, and test tubes. 

1 15. The method of any one of claim 11 used to evaluate 

2 physical measureiments obtained from biological and chemical assays 

3 conducted in one of substrates, substrates containing wells, and 

4 test tubes. 

1 16. The method of any one of claim 12 used to evaluate 

2 physical measurements obtained from biological and chemical assays 

3 conducted in one of substrates, substrates containing wells, and 

4 test tubes. 

1 17. The method of any one of claim 13 used to evaluate 

2 physical measurements obtained from biological and chemical assays 

3 conducted in one of substrates, substrates containing wells, and 

4 test tubes. 
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Fig. 1. Normalization method. 
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Fig. 2. Statistical modeling 
method. 
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